I am using data from the abc_poll_2021 data set. This data set contains information related to a national survey conducted in 2021. There are 31 variables and 527 rows of data, with each row representing a unique observation. Of these 31 variables, it appears that 16 capture participant-level information, 12 capture information related to survey questions, and 3 capture information related to survey administration.
Generated by summarytools 1.0.1 (R version 4.2.1) 2022-12-20
There were several columns whose values were captured more precisely in re-coded variables. For example, the variable ‘ABCAGE’ captured respondents’ age ranges, while ‘ppage’ captured respondents’ exact ages. So, I removed columns (‘ABCAGE’, ‘ppemploy’, and ‘ppeducat’) that had been re-coded to more accurately capture respondent demographic data. My clean data set now has 28 columns (variables) and 527 rows of data.
We can see from the summary data frame that there are some variables whose text can be simplified. We can also see that the content questions offer a ‘skip’ option, so we should see if we can mutate the values in those columns so that ‘skipped’ questions equate to missing values.
Identify variables that need to be mutated
Remove articles from variable ‘QPID’ (respondent party ID):
Code
# remove articles from QPID (party ID):ABC_mutate <- ABC_clean%>%mutate(partyid =str_remove(QPID, "A[n]*"))%>%select(-QPID)# check to ensure articles have been removed:table(ABC_mutate$partyid)
# mutate so 'skipped' = missing valueABC_mutate <- ABC_clean%>%mutate(across(starts_with("Q"), ~na_if(.x, "Skipped")))# apply mutation across variables starting with 'Q' and check to ensure 'skipped' has been replaced with a missing value:map(select(ABC_mutate, starts_with("Q")), table)
$Q1_a
Approve Disapprove
329 193
$Q1_b
Approve Disapprove
192 322
$Q1_c
Approve Disapprove
272 248
$Q1_d
Approve Disapprove
192 321
$Q1_e
Approve Disapprove
212 301
$Q1_f
Approve Disapprove
281 230
$Q2
Not concerned at all Not so concerned Somewhat concerned
65 147 221
Very concerned
94
$Q3
No Yes
107 415
$Q4
Excellent Good Not so good Poor
60 215 97 149
$Q5
Optimistic Pessimistic
229 295
$QPID
A Democrat A Republican An Independent Something else
176 152 168 28
Clean up “ppethm” column values:
Code
# find unique values in "ppethm" column:unique(ABC_mutate["ppethm"])
# mutate so that values are more succinct:ABC_mutate <- ABC_mutate%>%mutate(ethnicity =str_remove(ppethm, ", Non-Hispanic"))%>%select(-ppethm)# check new values:select(ABC_mutate, "ethnicity")
# A tibble: 527 × 1
ethnicity
<chr>
1 White
2 White
3 White
4 White
5 White
6 White
7 Other
8 Black
9 White
10 Other
# … with 517 more rows
# identify values in "resp_education" column:table(ABC_mutate_clean$resp_education)
Master\x92s degree or above
99
Bachelor\x92s degree
108
High school graduate (high school diploma or the equivalent GED)
133
No high school diploma or GED
29
Some college or Associate degree
158
[1] "High school graduate (high school diploma or the equivalent GED)"
[2] "Bachelor\x92s degree"
[3] "Master\x92s degree or above"
[4] "Some college or Associate degree"
[5] "No high school diploma or GED"
Code
# mutate and factor so that respondent education becomes ordinal (no high school, high school, some college, etc.):ABC_mutate_clean <- ABC_mutate_clean%>%mutate(resp_edu =factor(resp_education, levels=resp_edu_order[c(5,1,4,2,3)]))%>%select(-resp_education)rm(resp_edu_order)table(ABC_mutate_clean$resp_edu)
No high school diploma or GED
29
High school graduate (high school diploma or the equivalent GED)
133
Some college or Associate degree
158
Bachelor\x92s degree
108
Master\x92s degree or above
99
Source Code
---title: "Challenge 4 Solutions"author: "Caitlin Rowley"desription: "More data wrangling: pivoting"date: "10/18/2022"format: html: toc: true code-fold: true code-copy: true code-tools: truecategories: - challenge_4 - abc_poll---```{r}#| label: setup#| warning: false#| message: falseinstall.packages("summarytools")library(tidyverse)library(lubridate)library(readxl)library(summarytools)knitr::opts_chunk$set(echo =TRUE, warning=FALSE, message=FALSE)```## Challenge OverviewToday's challenge is to:1) read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)2) tidy data (as needed, including sanity checks)3) identify variables that need to be mutated4) mutate variables and sanity check all mutations```{r}# read in data:ABC <-read_csv("_data//abc_poll_2021.csv")```### Briefly describe the data```{r}# id column namescolnames(ABC)# break down question/variable types:ABC %>%select(starts_with("pp"))%>%colnames(.)# add 'ABCage' and 'xspanish' to demo questionsABC %>%select(starts_with("Q"))%>%colnames(.)# add 'contact' to survey questions# find duplicates:unique(ABC)# find missing values:missing <-is.na(ABC)missing["TRUE"]```I am using data from the abc_poll_2021 data set. This data set contains information related to a national survey conducted in 2021. There are 31 variables and 527 rows of data, with each row representing a unique observation. Of these 31 variables, it appears that 16 capture participant-level information, 12 capture information related to survey questions, and 3 capture information related to survey administration.```{r}#continue cleaning:# remove 'ABCAGE', 'ppemploy' 'ppeducat'ABC_clean <-subset(ABC, select =-c(ABCAGE, ppemploy, ppeducat))print(ABC_clean)# print data frame summary:print(summarytools::dfSummary(ABC_clean,varnumbers =FALSE,plain.ascii =FALSE, style ="grid", graph.magnif =0.75, valid.col =FALSE),method ='render',table.classes ='table-condensed')```There were several columns whose values were captured more precisely in re-coded variables. For example, the variable 'ABCAGE' captured respondents' age ranges, while 'ppage' captured respondents' exact ages. So, I removed columns ('ABCAGE', 'ppemploy', and 'ppeducat') that had been re-coded to more accurately capture respondent demographic data. My clean data set now has 28 columns (variables) and 527 rows of data.We can see from the summary data frame that there are some variables whose text can be simplified. We can also see that the content questions offer a 'skip' option, so we should see if we can mutate the values in those columns so that 'skipped' questions equate to missing values.## Identify variables that need to be mutatedRemove articles from variable 'QPID' (respondent party ID):```{r}# remove articles from QPID (party ID):ABC_mutate <- ABC_clean%>%mutate(partyid =str_remove(QPID, "A[n]*"))%>%select(-QPID)# check to ensure articles have been removed:table(ABC_mutate$partyid)```Replace 'skipped' values with missing values:```{r}# mutate so 'skipped' = missing valueABC_mutate <- ABC_clean%>%mutate(across(starts_with("Q"), ~na_if(.x, "Skipped")))# apply mutation across variables starting with 'Q' and check to ensure 'skipped' has been replaced with a missing value:map(select(ABC_mutate, starts_with("Q")), table)```Clean up "ppethm" column values:```{r}# find unique values in "ppethm" column:unique(ABC_mutate["ppethm"])# mutate so that values are more succinct:ABC_mutate <- ABC_mutate%>%mutate(ethnicity =str_remove(ppethm, ", Non-Hispanic"))%>%select(-ppethm)# check new values:select(ABC_mutate, "ethnicity")```Rename remaining variables:```{r}ABC_mutate_clean <-rename(ABC_mutate,"resp_id"="id","resp_language"="xspanish", "complete_status"="complete_status", "resp_age"="ppage", "resp_education"="ppeduc5", "resp_gender"="ppgender", "resp_household_size"="pphhsize", "resp_income"="ppinc7", "resp_marital_status"="ppmarit5", "resp_metro"="ppmsacat", "resp_region"="ppreg4", "resp_rent/own"="pprent", "resp_state"="ppstaten", "resp_employment"="PPWORKA", "resp_party_ID"="QPID", "resp_interview"="Contact", "resp_party_ID_weight"="weights_pid",)colnames(ABC_mutate_clean)```Try factor order:```{r}# identify values in "resp_education" column:table(ABC_mutate_clean$resp_education)# identify unique values:resp_edu_order <-unique(ABC_mutate_clean$resp_education)resp_edu_order# mutate and factor so that respondent education becomes ordinal (no high school, high school, some college, etc.):ABC_mutate_clean <- ABC_mutate_clean%>%mutate(resp_edu =factor(resp_education, levels=resp_edu_order[c(5,1,4,2,3)]))%>%select(-resp_education)rm(resp_edu_order)table(ABC_mutate_clean$resp_edu)```